SUMMARY

This analysis is based on the premise that widening political divisions are not just about “what to do” but, more fundamentally, about “what the actual problem is.” For instance even the conservative National Review highlighted disagreement about the most important issues.
Here, using standard NLP (Natural Lanuguage Processing) techniques, I explore this question looking for differences in the lanuguage of presidential candidates using texts from recent Republican and Democratic debates. Key findings are:
1. “Wordcloud” visualization shows systematic differences between candidates, though more surprising are the similarities.
2. Word frquency analysis highlights positional differences between candidatates, but does not convey important context information.
3. Bigram toeknization and word-stem searches do the best job of revealing position differences between candidates.

DATA SOURCES AND METHODS

The text of the presidential debates are downloaded from the UCSB Presidency Project. Transcripts were pasted into Apple Pages and stored as unformated .txt files. From that point all processing is done with R using capabilities of {tm} and associated libraries.

CANDIDATE WORD-CLOUDS

The quickest and most visual method to compare texts is word-frequency analysis using wordclouds. Not surprisingly, word choices vary significantly between candidates. However, there are also some striking similarities.

Let’s first just look at the word clouds of different candidates. We can address whether there are differences in the word frequencies used by candidates, as emphasized by the algorithm of the {wordcloud} package. Are there differences within the same party, between candidates, etc.?

THE POPULISTS: TRUMP AND SANDERS

Here are the word cloud of Donald Trump’s and Bernie Sanders’s dialogue at the debates. It’s surprising that their frequent word choices like people, country, and going are common, as if they are both painting a vision of the future that is personal (though radically different in nature).

c_wordcloud(trump_all)

c_wordcloud(sanders_all)

THE WOMEN: HILARY AND CARLY

In this case word clouds couldn’t be more different. Hilary’s emphasizes think and people while Carly’s, a former business woman, primarily emphasizes government.

c_wordcloud(clinton_all)

c_wordcloud(fiorina_all)

BRAINS VERSUS BRAUN: CRUZ AND HUCKABEE

Ted Cruz’s wordcloud emphasizes wonkish technicalities, like taxes and washington, while that of Mike Huckabee, a former minister, mixes the populist and a focus on government.

c_wordcloud(cruz_all)

c_wordcloud(huckabee_all)

STAYING ON MESSAGE: COMPARING DEBATES

We can also split the text by debate. Since the debates cover different topics and questions, one might expect to see this reflected in the text of the separate dialogues. What’s surprising here is how comparable the language of each candidate is between the debates. Perhaps the candidates are more interested in staying on message than answering questions directly?

c_wordcloud(candidate_text_tc("TRUMP", r_oct))

c_wordcloud(candidate_text_tc("TRUMP", r_nov))
c_wordcloud(candidate_text_tc("SANDERS", d_oct))

c_wordcloud(candidate_text_tc("SANDERS", d_nov))

A NOTE ON WORDCLOUDS Both visually appealing and interesting, wordclouds reveal differences between candidate word choices which hint at political preferences and opinions, but do not reveal significant detail about them. For instance, differences in positions on policy and important matters like taxes, terrorism, immigration, and class division are not deducible from the wordclouds.

WORD FREQUENCY

We can check word frequency directly by simply tokenizing the text and counting single words. Looking for the most frequent words used by each candidate may reveal some clearer differences. To do this analysis some additional words like “thats”, “dont”, “back”, “can”, “get”, “cant”, and “come” are suppressed.

These tell a bit of a clearer story, i.e. we can almost read these words like sentences or sentence fragments. Both Bernie Sanders and Donald Trump again seems to come across similiarly as populists. The first ranking word for Donald Trump is country and for Bernie Sanders it’s believe. It’s interesting to note that the notion of getting the “country going” comes through in the top three candidates.
More humorously, note that the word “clinton” figures in both Carly Fiorina’s and Hilary Clinton’s lists. These texts almost read as an assertion by Hilary and an counter argument by Carly.
Ted Cruz’s frequent words again seem to focus on business and Carlo Rubio loves America.

While word frequencies reveal differences between the approaches and personalities of the candidates, they don’t by themsleves elucidate differences on specific policies or attitudes. Let’s try something else.

##      Row.names trump sanders clinton fiorina all
## 1810    people    33      85      53      10 181
## 2483     think     9      55      90       9 163
## 1050     going    44      44      45      10 143
## 561    country    34      70      25       1 130
## 1363      know    23      26      56      19 124
## 2704      well     9      31      56       8 104
word clinton fionrina sanders trump NA
think 9 55 90 9 163
know 23 26 56 19 124
well 9 31 56 8 104
people 33 85 53 10 181
government 0 7 6 40 53
every 4 15 9 26 54
need 5 33 36 18 92
country 34 70 25 1 130
going 44 44 45 10 143

GRAPHICAL REPRESENTATION

There’s additional information in whether words used frequently by one candidate are used at all by another candidate. While the wordlcloud analysis gives some insight, we can analysis the data graphically for more quantitative information. Here is a graph of the “top” words used by all candidates

NORMALIZED Z STATISTICS

The above doesn’t reveal much more information than the wordcloud analysis does. However, we can also pick some “key words” and sample for their frequency. For a first stab, let’s try

key_words = c("tax", "government", "climate", "class", "wall", "street","terror", "economy", "immigrant", "america", "veteran", "drug", "health", "gun", "education", "bankruptcy", "money", "women", "war", "rights", "abortion", "violence")

##       Row.names        trump     sanders      clinton     fiorina all rank
## 1062 government 0.0000000000 0.001622624 0.0012992638 0.025316456  53    1
## 2671       wall 0.0065281899 0.006722299 0.0023819835 0.001265823  53    2
## 2447        tax 0.0071216617 0.002549838 0.0008661758 0.010759494  44    3
## 2377     street 0.0005934718 0.006490496 0.0025985275 0.001265823  43    4
## 1605      money 0.0065281899 0.004404265 0.0006496319 0.004430380  40    5
## 1128     health 0.0000000000 0.004172462 0.0030316154 0.001898734  35    6

WORD ASSOCIATIONS FROM BIGRAM TOKENIZATION

Since word fequency does not convey specific positions on issues, let’s look at word associations to see if we can get closer to meaning from more information about the context of word usage. This analysis simply tokenizes the text as bigrams, then uses a simple function

bigram_table[grep(word, rownames(bigram_table), ignore.case=TRUE)]

to pull out relevant terms from the torkenized TDM. A key challenge is that the texts are relatively short, so statistics comparing the word frequencies are poor. Nevertheless, we can see that context around different words, even at the relatively unsophisticated level of simple bigrams, starts to hint at differences in approach to problems.

CANDIDATES ON TAXES

Bernie talks about “tax” and “terror” as well. His discussion of taxes has a reformist bent, but where Carly Fiorina talks associates words like budgeting, changes, reform, simplify, code, reform, and plan, Bernie Sanders associates words like cap, income, must, share, speculation, breaks, reform, wall, and rebuilding.